Capturing text semantics for concept detection in news video
نویسندگان
چکیده
The overwhelming amounts of multimedia contents have triggered the need for automatic semantic concept detection. However, as there are large variations in the visual feature space, text from automatic speech recognition (ASR) has been extensively used and found to be effective to complement visual features in the concept detection task. Generally, there are two common text analysis methods. One is text classification and the other is text retrieval. Both methods have their own strengths and weaknesses. In addition, fusion of text and visual analysis is still an open problem. In this paper, we present a novel multi-resolution, multi-source and multi-modal (M3) transductive learning framework. We fuse text and visual features via a multi-resolution model. This is because different modal features only work well in different temporal resolutions, which exhibit different types of semantics. We perform a multi-resolution analysis at the shot, multimedia discourse and story levels to capture the semantics in news video. While visual features play a dominant role at the shot level, text plays an increasingly important role as we move from the multimedia discourse towards the story levels. Our multi-source inference transductive model provides a solution to combine text classification and retrieval method together. We test our M3 transductive model on semantic concept detection on the TRECVID 2004 dataset. Preliminary results demonstrate that our approach is effective.
منابع مشابه
A Multi-Pronged Approach to Improving Semantic Extraction of News Video
In this paper we describe a multi-strategy approach to improving semantic extraction from news video. Experiments show the value of careful parameter tuning, exploiting multiple feature sets and multilingual linguistic resources, applying text retrieval approaches for image features, and establishing synergy between multiple concepts through undirected graphical models. We present a discriminat...
متن کاملAutomatic Indexing and Retrieval of Large Broadcast News Video Collections - The TRECVID Experience
Most existing operational systems rely purely on automatic speech recognition (ASR) text as the basis for news video indexing and retrieval. While current research shows that ASR text has been the most influential component, results of large scale news video processing experiments indicate that the use of other modality features and external information sources such as the Web is essential in v...
متن کاملPersonalized news through content augmentation and profiling
This paper is concerned with the topic of personalized news assembly at the set-top box, based on augmented video. This is video complemented with additional information that is somehow relevant to the semantic video content. We touch upon the technique that is used for video augmentation, which is video subject detection followed by information searches on the subject. A focus of this paper is...
متن کاملBroadcast News Story Boundary Detection Using Visual, Audio and Text Features
News video story segmentation is vital for video summarization, story linking, and curation. We present a multimodal segmentation algorithm which fuses video, audio and text cues for story boundary detection. We show that broadcast news closed captioning is a rich and readily available source that improves story boundary detection. Furthermore, we propose an empirical distribution-based feature...
متن کاملArabic Text Detection in News Video Based on Line Segment Detector
Text embedded in video sequences is very important to semantic indexing and content-based retrieval system, especially for large scale news collection. However, its detection and extraction is still an open problem due to the variety of its size and the complexity of the backgrounds. In this paper, we propose an approach for automatic Arabic-text localization based on a novel method for text-li...
متن کامل